EXPLORE & SUMMARIZE DATA | White Wine Quality Analysis
Aurore Dupont
========================================================
In this project, we will be exploring a dataset about the quality of white wines, using exploratory data analysis techniques to explore relationships in one variable to multiple variables in R.
This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
- Elsevier
- Pre-press (pdf)
- bib
Input variables (based on physicochemical tests):
- Fixed acidity (tartaric acid - g / dm^3): Most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- Volatile acidity (acetic acid - g / dm^3): The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- Citric acid (g / dm^3): Found in small quantities, citric acid can add ‘freshness’ and flavor to wines
- Residual sugar (g / dm^3): The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- Chlorides (sodium chloride - g / dm^3): The amount of salt in the wine
- Free sulfur dioxide (mg / dm^3): The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- Total sulfur dioxide (mg / dm^3): Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- Density (g / cm^3): The density of water is close to that of water depending on the percent alcohol and sugar content
- pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- Sulphates (potassium sulphate - g / dm3): A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
- Alcohol (% by volume): The percent alcohol content of the wine
Output variable (based on sensory data):
- Quality: Score between 0 and 10
Several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
We are starting with some preliminary exploration of the dataset.
Summaries of the data and univariate plots will allow us to understand the structure of the individual variables in the dataset.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## vars n mean sd median trimmed mad
## X 1 4898 2449.50 1414.08 2449.50 2449.50 1815.44
## fixed.acidity 2 4898 6.85 0.84 6.80 6.82 0.74
## volatile.acidity 3 4898 0.28 0.10 0.26 0.27 0.09
## citric.acid 4 4898 0.33 0.12 0.32 0.33 0.09
## residual.sugar 5 4898 6.39 5.07 5.20 5.80 5.34
## chlorides 6 4898 0.05 0.02 0.04 0.04 0.01
## free.sulfur.dioxide 7 4898 35.31 17.01 34.00 34.36 16.31
## total.sulfur.dioxide 8 4898 138.36 42.50 134.00 136.96 43.00
## density 9 4898 0.99 0.00 0.99 0.99 0.00
## pH 10 4898 3.19 0.15 3.18 3.18 0.15
## sulphates 11 4898 0.49 0.11 0.47 0.48 0.10
## alcohol 12 4898 10.51 1.23 10.40 10.43 1.48
## quality 13 4898 5.88 0.89 6.00 5.85 1.48
## min max range skew kurtosis se
## X 1.00 4898.00 4897.00 0.00 -1.20 20.21
## fixed.acidity 3.80 14.20 10.40 0.65 2.17 0.01
## volatile.acidity 0.08 1.10 1.02 1.58 5.08 0.00
## citric.acid 0.00 1.66 1.66 1.28 6.16 0.00
## residual.sugar 0.60 65.80 65.20 1.08 3.46 0.07
## chlorides 0.01 0.35 0.34 5.02 37.51 0.00
## free.sulfur.dioxide 2.00 289.00 287.00 1.41 11.45 0.24
## total.sulfur.dioxide 9.00 440.00 431.00 0.39 0.57 0.61
## density 0.99 1.04 0.05 0.98 9.78 0.00
## pH 2.72 3.82 1.10 0.46 0.53 0.00
## sulphates 0.22 1.08 0.86 0.98 1.59 0.00
## alcohol 8.00 14.20 6.20 0.49 -0.70 0.02
## quality 3.00 9.00 6.00 0.16 0.21 0.01
## wine_id fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 1 7.0 0.27 0.36 20.7
## 2 2 6.3 0.30 0.34 1.6
## 3 3 8.1 0.28 0.40 6.9
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.045 45 170 1.0010 3.00
## 2 0.049 14 132 0.9940 3.30
## 3 0.050 30 97 0.9951 3.26
## sulphates alcohol quality
## 1 0.45 8.8 6
## 2 0.49 9.5 6
## 3 0.44 10.1 6
## wine_id fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 1 7.0 0.27 0.36 20.70
## 2 2 6.3 0.30 0.34 1.60
## 3 3 8.1 0.28 0.40 6.90
## 4 4 7.2 0.23 0.32 8.50
## 5 5 7.2 0.23 0.32 8.50
## 6 6 8.1 0.28 0.40 6.90
## 7 7 6.2 0.32 0.16 7.00
## 8 8 7.0 0.27 0.36 20.70
## 9 9 6.3 0.30 0.34 1.60
## 10 10 8.1 0.22 0.43 1.50
## 11 11 8.1 0.27 0.41 1.45
## 12 12 8.6 0.23 0.40 4.20
## 13 13 7.9 0.18 0.37 1.20
## 14 14 6.6 0.16 0.40 1.50
## 15 15 8.3 0.42 0.62 19.25
## 16 16 6.6 0.17 0.38 1.50
## 17 17 6.3 0.48 0.04 1.10
## 18 18 6.2 0.66 0.48 1.20
## 19 19 7.4 0.34 0.42 1.10
## 20 20 6.5 0.31 0.14 7.50
## 21 21 6.2 0.66 0.48 1.20
## 22 22 6.4 0.31 0.38 2.90
## 23 23 6.8 0.26 0.42 1.70
## 24 24 7.6 0.67 0.14 1.50
## 25 25 6.6 0.27 0.41 1.30
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.045 45 170 1.0010 3.00
## 2 0.049 14 132 0.9940 3.30
## 3 0.050 30 97 0.9951 3.26
## 4 0.058 47 186 0.9956 3.19
## 5 0.058 47 186 0.9956 3.19
## 6 0.050 30 97 0.9951 3.26
## 7 0.045 30 136 0.9949 3.18
## 8 0.045 45 170 1.0010 3.00
## 9 0.049 14 132 0.9940 3.30
## 10 0.044 28 129 0.9938 3.22
## 11 0.033 11 63 0.9908 2.99
## 12 0.035 17 109 0.9947 3.14
## 13 0.040 16 75 0.9920 3.18
## 14 0.044 48 143 0.9912 3.54
## 15 0.040 41 172 1.0002 2.98
## 16 0.032 28 112 0.9914 3.25
## 17 0.046 30 99 0.9928 3.24
## 18 0.029 29 75 0.9892 3.33
## 19 0.033 17 171 0.9917 3.12
## 20 0.044 34 133 0.9955 3.22
## 21 0.029 29 75 0.9892 3.33
## 22 0.038 19 102 0.9912 3.17
## 23 0.049 41 122 0.9930 3.47
## 24 0.074 25 168 0.9937 3.05
## 25 0.052 16 142 0.9951 3.42
## sulphates alcohol quality level
## 1 0.45 8.8 6 Medium
## 2 0.49 9.5 6 Medium
## 3 0.44 10.1 6 Medium
## 4 0.40 9.9 6 Medium
## 5 0.40 9.9 6 Medium
## 6 0.44 10.1 6 Medium
## 7 0.47 9.6 6 Medium
## 8 0.45 8.8 6 Medium
## 9 0.49 9.5 6 Medium
## 10 0.45 11.0 6 Medium
## 11 0.56 12.0 5 Medium
## 12 0.53 9.7 5 Medium
## 13 0.63 10.8 5 Medium
## 14 0.52 12.4 7 Medium
## 15 0.67 9.7 5 Medium
## 16 0.55 11.4 7 Medium
## 17 0.36 9.6 6 Medium
## 18 0.39 12.8 8 High
## 19 0.53 11.3 6 Medium
## 20 0.50 9.5 5 Medium
## 21 0.39 12.8 8 High
## 22 0.35 11.0 7 Medium
## 23 0.48 10.5 8 High
## 24 0.51 9.3 5 Medium
## 25 0.47 10.0 6 Medium
Now, let’s plot some univariate data :
## wine_id fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality level
## Min. : 8.00 Min. :3.000 Low : 20
## 1st Qu.: 9.50 1st Qu.:5.000 Medium:4698
## Median :10.40 Median :6.000 High : 180
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
As mentioned in the previous section, there might be some relationships between variables. To look at the correlation, we are creating a matrix.
## Positive Negative
## Small .1 to .3 -0.1 to -0.3
## Medium .3 to .5 -0.3 to -0.5
## Large .5 to 1.0 -0.5 to -1.0
Now, we are analizing the quality versus some variables.
Let’s take a look at the relationships between the following variables:
- alcohol & pH,
- alcohol & density,
- alcohol & residual sugar.
Now, let’s take a look at our linear model.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + sulphates, data = wine)
## m3: lm(formula = quality ~ alcohol + sulphates + pH, data = wine)
## m4: lm(formula = quality ~ alcohol + sulphates + pH + density, data = wine)
## m5: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide,
## data = wine)
## m6: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide +
## total.sulfur.dioxide, data = wine)
## m7: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide +
## total.sulfur.dioxide + chlorides, data = wine)
## m8: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide +
## total.sulfur.dioxide + chlorides + residual.sugar, data = wine)
## m9: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide +
## total.sulfur.dioxide + chlorides + residual.sugar + citric.acid,
## data = wine)
## m10: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide +
## total.sulfur.dioxide + chlorides + residual.sugar + citric.acid +
## volatile.acidity, data = wine)
## m11: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide +
## total.sulfur.dioxide + chlorides + residual.sugar + citric.acid +
## volatile.acidity + fixed.acidity, data = wine)
##
## ==================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** 2.341*** 1.683*** -20.991*** -12.729* -23.194*** -21.332*** 111.361*** 122.034*** 109.087*** 150.193***
## (0.098) (0.110) (0.250) (6.181) (6.207) (6.413) (6.419) (13.579) (13.863) (13.507) (18.804)
## alcohol 0.313*** 0.314*** 0.311*** 0.353*** 0.358*** 0.352*** 0.336*** 0.202*** 0.188*** 0.239*** 0.193***
## (0.009) (0.009) (0.009) (0.015) (0.015) (0.015) (0.015) (0.019) (0.020) (0.019) (0.024)
## sulphates 0.476*** 0.429*** 0.392*** 0.360*** 0.425*** 0.432*** 0.670*** 0.659*** 0.576*** 0.631***
## (0.100) (0.101) (0.101) (0.101) (0.101) (0.101) (0.102) (0.102) (0.099) (0.100)
## pH 0.225** 0.229** 0.213** 0.233** 0.215** 0.488*** 0.552*** 0.466*** 0.686***
## (0.077) (0.077) (0.076) (0.076) (0.076) (0.079) (0.081) (0.079) (0.105)
## density 22.367*** 13.858* 24.578*** 23.015*** -110.494*** -121.415*** -108.112*** -150.284***
## (6.093) (6.125) (6.345) (6.347) (13.611) (13.909) (13.552) (19.075)
## free.sulfur.dioxide 0.006*** 0.009*** 0.009*** 0.007*** 0.006*** 0.004*** 0.004***
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002*** -0.002*** -0.000 -0.000
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## chlorides -2.235*** -1.456** -1.594** -0.540 -0.247
## (0.552) (0.550) (0.550) (0.539) (0.547)
## residual.sugar 0.062*** 0.066*** 0.066*** 0.081***
## (0.006) (0.006) (0.006) (0.008)
## citric.acid 0.356*** 0.059 0.022
## (0.096) (0.095) (0.096)
## volatile.acidity -1.896*** -1.863***
## (0.113) (0.114)
## fixed.acidity 0.066**
## (0.021)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.193 0.195 0.197 0.209 0.215 0.218 0.237 0.239 0.280 0.282
## adj. R-squared 0.190 0.193 0.194 0.196 0.209 0.214 0.217 0.236 0.238 0.279 0.280
## sigma 0.797 0.796 0.795 0.794 0.788 0.785 0.784 0.774 0.773 0.752 0.751
## F 1146.395 587.145 394.902 300.301 259.103 223.866 194.831 189.967 170.827 190.448 174.344
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5828.015 -5823.718 -5816.981 -5779.268 -5760.359 -5752.163 -5691.736 -5684.858 -5548.674 -5543.740
## Deviance 3112.257 3097.833 3092.402 3083.907 3036.780 3013.424 3003.356 2930.157 2921.939 2763.891 2758.329
## AIC 11684.782 11664.029 11657.436 11645.962 11572.535 11536.719 11522.326 11403.472 11391.715 11121.347 11113.480
## BIC 11704.272 11690.016 11689.918 11684.942 11618.011 11588.691 11580.795 11468.438 11463.177 11199.306 11197.936
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ==================================================================================================================================================================================
It is likely that with more variables - such as grape types, states, weather (rain and sun impact the quality), expertise, organic or not, fermentation time, type of barrel,- a thorough analysis could have been conducted.
https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html
http://rprogramming.net/rename-columns-in-r/
http://www.cookbook-r.com/Manipulating_data/Adding_and_removing_columns_from_a_data_frame/
https://stackoverflow.com/questions/19440069/ggplot2-facet-wrap-strip-color-based-on-variable-in-data-set
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
https://stackoverflow.com/questions/21140798/error-using-corrplot/21141799
http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
https://www.cyclismo.org/tutorial/R/tables.html
http://r-statistics.co/Linear-Regression.html
https://stackoverflow.com/questions/43359050/error-continuous-value-supplied-to-discrete-scale-in-default-data-set-example/43359104
http://felixfan.github.io/ggplot2-remove-grid-background-margin/
https://rstudio-pubs-static.s3.amazonaws.com/228019_f0c39e05758a4a51b435b19dbd321c23.html#1_plot_one_variable_-_x:_continuous_or_discrete